{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f5f0bee2",
   "metadata": {},
   "source": [
    "## 网络问题"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dfbe312",
   "metadata": {},
   "source": [
    "```\n",
    "import os\n",
    "os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'\n",
    "os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'\n",
    "```\n",
    "\n",
    "- 上述操作同样适用于 `huggingface-cli`\n",
    "\n",
    "    ```\n",
    "    $ export HTTP_PROXY='http://127.0.0.1:7890'\n",
    "    $ export HTTPS_PROXY='http://127.0.0.1:7890'\n",
    "    ```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63ff308c",
   "metadata": {},
   "source": [
    "## huggingface-cli"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46d88efb",
   "metadata": {},
   "source": [
    "\n",
    "```\n",
    "$ huggingface-cli whoami\n",
    "$ huggingface-cli login\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6e26527",
   "metadata": {},
   "source": [
    "## cache dir"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ebf142b",
   "metadata": {},
   "source": [
    "- https://stackoverflow.com/questions/63312859/how-to-change-huggingface-transformers-default-cache-directory\n",
    "- 直接 `vim ~/.zshrc`\n",
    "    - `export HF_HOME='/media/whaow/.cache/huggingface'`\n",
    "- 加载本地文件\n",
    "    - `AutoTokenizer.from_pretrained('./xx', cache_dir=\"~/cache/\")`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56a56d21",
   "metadata": {},
   "source": [
    "## `load_dataset`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9889d704",
   "metadata": {},
   "source": [
    "```\n",
    "# csv\n",
    "load_dataset('csv', data_files='my_file.csv')\n",
    "\n",
    "# text\n",
    "load_dataset('txt', data_files='my_file.txt')\n",
    "\n",
    "# JSON\n",
    "load_dataset('json', data_files='my_file.json')\n",
    "\n",
    "# parquet\n",
    "load_dataset('parquet', data_files='my_file.parquet')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80c39c49",
   "metadata": {},
   "source": [
    "```\n",
    "!wget https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\n",
    "\n",
    "\n",
    "from datasets import load_dataset\n",
    "local_csv_dataset = load_dataset(\"csv\", data_files=\"winequality-white.csv\", sep=\";\")\n",
    "local_csv_dataset[\"train\"]\n",
    "\n",
    "\n",
    "# Load the dataset from the URL directly\n",
    "dataset_url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv\"\n",
    "remote_csv_dataset = load_dataset(\"csv\", data_files=dataset_url, sep=\";\")\n",
    "remote_csv_dataset\n",
    "\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cd7a8be",
   "metadata": {},
   "source": [
    "### 加载本地文件"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20e0df45",
   "metadata": {},
   "source": [
    "- `git clone ` 的方式下载 Huggingface 上的 datasets 文件，到本地\n",
    "    - git lfs：会更快，比如 `transformersbook/codeparrot`\n",
    "\n",
    "```\n",
    "from datasets import load_dataset_builder\n",
    "# 加载本地文件\n",
    "dataset_builder = load_dataset_builder('path/dir')\n",
    "# 查看缓存目录，事实上只需要缓存一次，便可直接读取；\n",
    "dataset_builder.cache_dir\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
